Sequence Alignments

Gepoliano Chaves

March 20th, 2024

Overview

After completing this session, students should be able to:

  • Practice global Needleman Wunsch algorithm in their notebook

  • Align a sequence

  • Download R

  • Download and align SARS-CoV-2 sequences from the State of Bahia, Brazil

Sequence alingnments: Global Alignment

1) Designing an algorithm: describe in human language

  • Describe what needs to be done in human language
  • What the problem is
    • For global alignment: to align two whole sequences against each other to evaluate how much similar they are
    • Compare nucleotide by nucleotide, what the identity match is or is not
    • Account for nucleotides that have the same or different identities

Sequence alingnments: Global Alignment

1) Designing an algorithm: describe in human language

Sequence alingnments: Global Alignment

1) Human language: Goals of Sequence Alignments

  • Goals of the alignment
    • Measure similarity
    • Observe patterns of sequence conservation between related biological species and variability of sequences over time and geographic location
    • Infer evolutionary relationships

Sequence alingnments: Global Alignment

ALGORITHM

Sequence alingnments: Global Alignment

1) Human language: Goals of Sequence Alignments

  • Steps:
    • Initialization
    • Matrix fill or scoring
    • Traceback and alignment

Sequence alingnments: Global Alignment

2) Pseudocode: Use equations to describe the calculations to be made in the algorithm

  • Rules:
    • Fill the first column and the last row with gap values
    • Value of box beside + Gap value
    • Value of box bottom + Gap value
    • Diagonal value + {match/mismatch}

Sequence alingnments: Global Alignment

Initialization

  • In your notebook, please create columns to align two sequences

Scoring Matrix. Figure from Bhawsar (2016)

Sequence alingnments: Global Alignment

Scoring: Filling the matrix

Scoring (filling) the matrix. Figure from Bhawsar (2016)

Sequence alingnments: Global Alignment

2) Pseudocode: Continuing the procedure

Scoring (filling) the matrix. Figure from Bhawsar (2016)

Sequence alingnments: Global Alignment

2) Pseudocode: Implementation (not using code) of the Needlemand-Wunsch Scoring Matrix

  • We could have used code to fill in this matrix

  • For the Traceback step, we follow the pointers (the arrows)

Scoring Matrix. Figure from Bhawsar (2016)

Sequence alingnments: Global Alignment

3) Traceback: alignment of the Needlemand-Wunsch Matrix

  • We could have used code to fill in this matrix

  • For the Traceback step, we follow the pointers (the arrows)

Complete traceback and alignment. Figure from Bhawsar (2016)

Sequence alingnments: Local Alignment

  • Smith-Waterman
    • Match +1, Mismatch -1, GAP penalty -2
    • https://www.youtube.com/watch?v=bFDRny7T3_s&t=3s
    • Query sequence vs. database sequence on a character to character level
    • Dinamic programming: divide problems into sub-problems for optimal solution
    • initalization, matrix filling and trace back

Sequence alingnments: Local Alignment

Sequence alignements: the scoring system

Sequence alignment: the scoring system.

Algorithm or Pipeline

  • An algorithm can also called a pipeline.

  • The first step to develop an algorithm is to objectively explain how to answer a question or solve a problem.

  • A Variant Calling algorithm identifies Variants or Mutations in the genome of an organism.

  • Also called SNP calling pipeline.

Algorithm or Pipeline

  • Here is the pseudocode of a SNP calling pipeline:

    • Align to reference sequence (FASTA)

    • Compare alignment to reference (SAM)

    • Annotate differences (mutations) (VCF)

    • Extract mutations from VCF using script

    • Construct a SNP Frequency Table

Algorithm or Pipeline

  • Schematic of a SNP call pipeline

    • The blue boxes indicates the analysis being performed.

    • The text above the boxes indicates the software used for each analysis. Figure is from r-charts (n.d.).

Schematic of a variant call pipeline.

Bioinformatics Software Development

  • Software development considers the analytical steps in human language

    • What are the exact steps that are necessary for execution of the analysis?
  • Then, the software product considers the steps the machine will execute

  • How files are produced and what are the processing steps?

  • Where in the computational infra-structure are the files stored?

Bioinformatics Software Development

Conclusions

  • We can develop our own computational methods to understand biology and propose solutions

  • In order to do that we need to follow these three steps for developing a computational algorithm that will solve a problem:

Bioinformatics Software Development

Conclusions

    1. Describe the problem in human language and propose solutions in ways that are inteligible to human collaborators
    1. Start using mathematical equations and figure out a computational language to write code to process data related to a problem
    • For example: the genetics of racial groups in Brazil, a population of mixed descent
    1. Write a script or code to run computational experiments that demonstrate possibilities to solve or address the problem

Bioinformatics Software Development

  • Let’s move on to describe the DNA sequencing methods

Multiple Alignments

  • In multiple sequences the alignment is much more significant than just two sequences

  • Score higher when multiple sequences align

  • The similarities refer to functional equivalence and evolutionary relationships between the two proteins

Multiple Sequence Alignment.

Multiple Alignments

  • Load the msa library
library(msa)
  • Read FASTA file and create DNAStringSet object
dna_sarsCov2_start_30000 <- readDNAStringSet(file="~/Desktop/Gepoliano/UFOB/sequencesfilogeniaData/bahia_combined_30000_muscle.fasta")
  • Visualize DNAStringSet object
dna_sarsCov2_start_30000

HMM

Bioinformatics Software Development (continued)

  • How can these files be accessed?

  • What information do the files containe?

  • The present program is about how a scientific question is answered, not what the final answer is

  • If how the question is answered is not addressed, opportunity is lost in terms of information that is embedded in the process of data analysis

  • This is an important notion to have when developing computational tools that answer a scientific question

References

Bhawsar, Harshita. 2016. Needleman-Wunch Algorithm. https://www.slideshare.net/HarshitaBhawsar/needlemanwunch-algorithm-harshita.
r-charts. n.d. SNP/Variant Calling Tutorial. Accessed March 31, 2024. https://docs.tinybio.cloud/docs/snpvariant-calling-tutorial.